Development of a prediction model for the depression level of the elderly in low-income households: using decision trees, logistic regression, neural networks, and random forest

Korea is showing the fastest trend in the world in population aging; there is a high interest in the elderly population nationwide. Among the common chronic diseases, the elderly tends to have a high incidence of depression. That said, it has been vital to focus on preventing depression in the elderly in advance. Hence, this study aims to select the factors related to depression in low-income seniors identified in previous studies and to develop a prediction model. In this study, 2975 elderly people from low-income families were extracted using the 13th-year data of the Korea Welfare Panel Study (2018). Decision trees, logistic regression, neural networks, and random forest were applied to develop a predictive model among the numerous data mining techniques. In addition, the wrapper’s stepwise backward elimination, which finds the optimal model by removing the least relevant factors, was applied. The evaluation of the model was confirmed via accuracy. It was verified that the final prediction model, in the case of a decision tree, showed the highest predictive power with an accuracy of 97.3%. Second, psychological factors, leisure life satisfaction, social support, subjective health awareness, and family support ranked higher than demographic factors influencing depression. Based on the results, an approach focused on psychological support is much needed to manage depression in low-income seniors. As predicting depression in the elderly varies on numerous influencing factors, using a decision tree may be beneficial to establish a firm prediction model to identify vital factors causing depression in the elderly population.

www.nature.com/scientificreports/ standard median income is the income of the person in the center of the line when all people are lined up, and the KOWEPS considers 60% or less to be low income. As aforementioned, low-income elderly are reported to have poorer health outcomes than other populations and are significantly affected by surroundings. Therefore, data from the 13th-year (2018) data was chosen, excluding any potential non-related impacts from the era of COVID-19; the first case of COVID-19 in South Korea was in January 2019 25 .

Construction of variables. Target variable.
The KOWEPS provides CES-D 11 (The Center for Epidemiological Studies-Depression Scale) as a measure of depression. The scale was reconstructed by reducing the 20-item instruments developed by Radloff (1977) to 11-item instruments 26 . The instruments consist of the following questions: I did not feel like eating; my appetite was poor; I felt that I was just as good as other people; I felt depressed; I felt that everything I did was an effort; My sleep was restless; I felt lonely; I enjoyed life; People were unfriendly; I felt sad; I felt that people dislike me; and I could not get "going. " The range of responses were from 0 (rarely or none of the time) to 3 (most or all of the time). In this study, the total score of the 20-item circle scale was used for analysis by multiplying by 20/11 to determine whether or not there was depression. The higher the value, the higher the level of depression indicated. Depression can be suspected if the score is 16 point or more, and a score less than 16 can be considered normal.
Input variable. Based on the literature review discussed above, the input variables used in this study are as follows. Gender, age, education level, number of household members, disability, economic activity, and chronic disease were included as demographic factors. Second, social support, family support, and leisure life satisfaction are measured on a four-point Likert scale, respectively, and the higher the score, the higher the support and satisfaction. Third, health promotion behavior is a concept that encompasses various factors, such as beliefs, behaviors, and habits necessary for health promotion and maintenance. However, this study was limited to factors of health behavior and lifestyle provided by the KOWEPS. Drinking was scored as 1 point for 'the average amount of alcohol consumed per year'; if there was no drinking experience at all, 0 points for drinking experience at least once. For smoking, ' currently smoking cigarettes, ' 0 points if smoking, and 1 point was given for nonsmokers. The average of the health checkup was calculated by giving 0 points if it had never been done and 1 point if it was done once; the higher the score, the more health behaviors it had. Fourth, subjective health awareness is measured on a four-point Likert scale; the higher the score, the higher the subjective health awareness, and the level of medical expenditure means the average monthly medical expenditure. The factors used in the analysis are summarized in Table 1.
Statistical analysis. Frequency analysis, T-test, and one-way ANOVA analysis were performed to verify whether statistical differences occurred according to the demographic characteristics and depression level of the  www.nature.com/scientificreports/ participants of this study. Then, data mining techniques, logistic regression analysis, decision tree analysis, artificial neural network analysis and random forest analysis were used to build a predictive model for depression in the elderly of low-income households. A sensitivity analysis was conducted to ensure that the main outcome was reliable and robust. The analysis was carried out by changing the cut-off score for suspected depression as the dependent variable. Logistic regression analysis is the most common method used when the target factor is binary, and it has the advantage of supplementing data that only takes a value of 0-1. An artificial neural network is one of the most widely used methodologies to predict the category of target factors by combining input factors with a nonlinear model, passing them to each hidden unit, and delivering the combination of hidden units to the output node. A decision tree analysis is a technique that classifies the categories of target factors by tabulating decision-making rules in the form of a tree structure. Since it is expressed in a tree structure, it is easy to interpret the classification results and has the advantage of obtaining information on major predictive factors. In this study, C5.0, one of the types of decision trees, was used. Random forest is a model that improves the shortcomings of decision tree and is reported to have excellent performance because it can prevent overfitting by applying bagging technique to generate multiple decision trees 27 . Finally, logistic regression analysis was conducted to identify the predictors of high risk of depression. For the development and evaluation of the predictive model, a tenfold cross-validation method was used in which the entire data was divided into ten categories for generalization and used as model creation (9) and validation (1) data 28 . After examining the relative importance of predictive factors via Shapley additive explanation analysis that contributed to predicting the depression level of the elderly in low-income households, wrapper's stepwise backward elimination was applied to find the optimal model by removing the least relevant factors. The models created via the process mentioned above were evaluated based on accuracy, and then the optimal model for this topic was selected.
The performance index of the developed prediction model means that the larger the size, the stronger the predictive power of the depression level. The model's final evaluation was based on accuracy, and sensitivity and specificity values were also presented. The analysis packages, IBM SPSS Modeler 18.0 (SPSS Inc., Chicago, Illinois, USA) and SAS 9.4 (SAS Institute Inc., Cary, NC), were used. Ethical approval. This study was approved by the Korea University Institutional Review Board (IRB No. IRB-2022-0385). The IRB of Korea University waived informed consent since this study was retrospective and blinding of the personal information in the data was performed.

Results
The results of the demographic characteristics of the study and the average difference in depression levels are shown in Table 2. Females accounted for a higher number than males; females (n = 2008, 67.5%) and males (n = 967, 32.5%). For the age distribution, 'ages of 80 or older' was the largest with 1475 people (49.6%), followed by 'ages of 75-79' with 755 people (25.4%) and 'ages of 70-74' with 459 people (15.4%). Regarding the level of education, 2107 people (70.8%) had 'elementary school graduation' , and 471 people (15.8%) had 'middle school graduation' . indicating that the majority had a low level of education. When asked about the number of household members, 'two people' accounted for the largest portion with 1459 people (49.0%), showing that the majority lived with one more person. As for having disabilities, 2425 people (81.5%) mentioned 'no' , and 550 people (18.5%) stated 'yes' . In terms of participation in economic activities, 'not participating' accounted for more than half of the participants (n = 2042, 68.6%).  www.nature.com/scientificreports/ In terms of depression, 2553 people (85.8%) reported 'no' and 422 people (14.2%) reported 'yes. ' Lastly, the average difference between the sociodemographic characteristics and the depression level of the participants was evaluated. As a result, there were significant gender differences (t = − 3.547, p < 0.001) and participation in economic activities (F = 7.326, p < 0.001), but no differences were found in other factors.
The descriptive statistical results of the main factors are shown in Table 3. Considering that the range of scores for health promoting behavior is from a minimum of 0 to a maximum of 3, an average of 2.2 points can be regarded as a high value. On the other hand, having a standard deviation of 0.72, it can be understood that there was no significant difference in health promotion behavior by the elderly in low-income households. With reference to subjective health awareness, it was found that numerous elderly people had a higher awareness than the average, with an average of 2.8 points. Regarding the level of medical expenses, it was found that the average monthly expenditure was 158,000 won, and the standard deviation was 20.17, indicating a high difference in expenditure among the elderly in low-income households. Family support, social support, and leisure life satisfaction showed average scores of 2.7, 2.6, and 2.3, respectively, which were verified to be in good standing, considering that the range of scores was at least 0 to up to 4.
The relative importance of the predictive factors that contributed to predicting depression in low-income seniors utilizing the feature selection, is shown in Table 4. The higher the order of importance of a predictor, the greater the influence of that factor in predicting the level of depression; the highest ranking was identified as 'leisure life satisfaction. ' This result can be interpreted as having the greatest effect on satisfaction in leisure life than other factors when predicting the level of depression of the elderly in low-income households. Furthermore, the factors of subjective health awareness, family support, and social support were found to be in the upper ranks. However, it was noted that the factors of presence or absence of chronic diseases, educational level, disability, and health behavior were distributed in the low ranking. A SHAP summary plot was created (Fig. 2), a visualization of how much each explanatory variable affects the prediction of depression. A yellow bar indicates a positive influence on the occurrence of depression. The red and orange bars indicate a negative impact on the occurrence of depression. The red bars were found to be the most influential variables. Regarding leisure life satisfaction, it Table 2. General characteristics of the participants and differences in depression level. p* < .05, p** < .01, p*** < .001.  www.nature.com/scientificreports/ can be used as an explanatory or a dependent variable. This study used it as an explanatory variable because the subjects were low-income elderly. The relationship between leisure life satisfaction and depression in low-income elderly is often reported as causal, with leisure life satisfaction affecting depression 29 .
In this study, the classification techniques used to develop the most accurate predictive model, predicting the level of depression of the elderly in low-income households, were artificial neural networks, decision trees, logistic regression and random forest analysis. Table 5 is the result of the classification analysis by sequentially applying the wrapper's stepwise method to the relative importance of the factors identified in Table 4. Based on the analysis, it was identified that the decision tree algorithm showed higher predictive power than the other three algorithms. In the case of logistic regression analysis, the prediction accuracy was 73.2%, and the artificial neural network showed 81.8%. On the other hand, the decision tree shows a tendency to increase predictive accuracy as the number of factors increases, except when there is only one input factor. When all 13 factors were input, an accuracy of 97.3%, a sensitivity of 100%, and a specificity of 94.6% were presented. Finally, when forming the decision-making tree, the factor that had the greatest impact was the subjective health awareness factor, followed by leisure life satisfaction, family support, and social support. To ensure that the main outcome was reliable and robust, a sensitivity analysis was conducted by dividing the dependent variable, depression incidence, into two thresholds (15 points or less, 16 points or more); the analysis revealed that the main outcome did not change in Tables 6, 7.
Logistic regression analysis was performed to seek the influence of the predictors of high risk of depression in the elderly from low-income households, and the results are shown in Table 8. The factors that affected the level of depression were gender, number of household members, subjective health awareness, family support, social support, and satisfaction with leisure life. In the case of gender, the probability of developing depression in women was confirmed to be 1.86 times (OR = 1.861, 95% CI = 1.173-2.954) higher than in men. As the number of household members increased by each level, the probability of depression decreased by 0.69 times (OR = 0.692, 95% CI = 0.513-0.933). In subjective health awareness, an increase of each level was associated with a 0.40-fold (OR = 0.403, 95% CI = 0.312-0.522) lower probability of depression. Further, family support (OR = 0.613, 95% www.nature.com/scientificreports/

Discussion
This study analyzed the factors affecting the depression of the elderly from low-income families, using the KOWEPS data based on the literature review mentioned above. The study initially determined whether the factors are related to depression in the elderly of low-income families and then developed a prediction model to predict depression. As a result of the analysis, the decision tree had the highest accuracy as a model for predicting depression among the elderly from low-income families, and the factors that greatly influenced the formation of the model were mainly psychological. The main findings are as follows. First of all, as a result of sequentially applying wrapper's step-by-step removal method to the relative importance of factors that affect predicting depression in the elderly from low-income families, it was confirmed that the decision tree analysis showed the highest predictive power (97.3%). This result is consistent with previous studies that decision trees show excellent results in developing predictive models. As Lee et al., stated, when developing a model that predicts patient satisfaction and revisits intention according to hospital visits, artificial neural networks, logistic regression analysis, and decision trees (C5.0, CART, QUEST) were used, and the decision trees showed the highest predictive power, and C5.0 showed excellent results 30 . Moreover, decision trees (C5.0, CHAID, and QUEST) were used in a model development study that predicts whether patients with severe work histories are admitted to the intensive care unit. As a result, it was found that C5.0 showed the best predictive power 31 . With all that said, the decision tree (C5.0) has the advantage of having an algorithm that can more effectively handle complex relationships between predictors, which is widely used in the healthcare field. More importantly, it is known as one of the classification techniques of data mining with proven effectiveness 32 . It is expected that effective depression management services can be provided by detecting groups with a high risk of depression at an early stage. Further refinement of the model to include additional community infrastructure and geographic factors related to depression may lead to more diverse measures to prevent depressive problems among low-income elderly.
Second, when the decision tree (C5.0) was formed, subjective health awareness, leisure life satisfaction, family support, and social support were the factors that had a relatively significant influence. This outcome is supported by a study that depressive disorder in the elderly is on the rise worldwide and that psychological factors such as social support and subjective health awareness are key contributing factors 33 . Another study reported that life satisfaction and subjective health awareness have the most significant influence 34 . Depression in the elderly has been shown to have a significant psychological impact, and decision trees are reported to be a highly effective method 35,36 . In order to prevent and manage depression in the elderly, it is necessary to recognize the need for policy support considering psychological factors (subjective health awareness, leisure life satisfaction, family support, and social support). For example, adequate mental health management can be provided by conducting free quarterly psychological examinations on low-income elderly at public health centers and local clinics in each region to detect risk groups for depression while developing and operating programs to increase psychological support in the community service centers.
Third, a logistic regression analysis was conducted to confirm the predictors of depression in the low-income elderly. As a result, gender, the number of household members, subjective health awareness, leisure life satisfaction, family support, and social support were identified as influencing factors. It was found that the higher the risk of depression, especially for women, the smaller the number of household members, the lower the satisfaction level of leisure life, the lower the family support and social support, and the lower the level of subjective health awareness. These results were aligned with the same context as previous studies 18, [35][36][37] . The level of depression according to income level can also be examined. Muhammad et al. reported that the elderly population in the Table 8. The results of logistic analysis according to the level of depression of the elderly in low-income households. p* < .05, p** < .001, p*** < .001, OR: odds ratio, CI: confidence interval. www.nature.com/scientificreports/ poorest fifth quintile was 39% more likely to develop depression than the elderly in the first quintile 38 . Thus, it can be presumed that depression in the elderly is not caused by a single factor but by a combination of various factors. With that mentioned, forming activities in the local community that senior citizens can participate in, such as senior universities and clubs, while encouraging active promotion and participation are considered to prevent depression in the long run. All activities could be provided free of charge considering the characteristics of the low-income elderly, and if necessary, it may be an idea to encourage participation by offering a subsidy. In South Korea, various psychological support programs for the elderly exist in different regions so that it would be more effective to form a network to establish and manage roles and functions across regions. For example, the community service centers in each region act as gatekeepers to identify groups of people who are likely to be depressed and encourage to participate in the community-based psychological support program. Finally, the limitations of this study are as follows. First, various factors affecting depression in the elderly were not examined. Previous studies have shown that various factors, such as biological factors, cultural factors, and environmental factors, act in combination to affect depression; however, the current study did not include all factors due to data limitations. Second, this study was conducted as a cross-sectional study, and there are some difficulties in identifying the causal relationship over time. Thirdly, in terms of the influence of depression, the characteristics of the age of the elderly were not considered. Since recent old age has various characteristics by period, which are classified into the first, middle, and late stages, it is highly likely that different patterns will appear regarding the factors influencing depression and the size of its impact.

Conclusion
This study selected factors related to depression in the elderly from low-income families identified in previous studies to develop a prediction model considering depression in the elderly from low-income families. As a result of the study, psychological factors (leisure life satisfaction, subjective health awareness, family support and social support) were higher than demographic factors, and the most suitable predictive model was identified as a decision tree. The aforementioned results suggest that an approach focused on psychological support is needed to manage the level of depression in low-income seniors. More importantly, as several influencing factors of depression vary in the elderly population, utilizing a decision tree will be beneficial to establish a more concrete prediction model.